Today, we will start creating beautiful visualization with R.
tidyverse packageFirst, let’s load the tidyverse library, using the command library(tidyverse):
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.2
## -- Attaching packages ------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.0
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
dplyrTo manage data frames, we will use the dplyr library, already downloaded with tidyverse.
dplyr GrammarSome of the key “verbs” provided by the dplyr package are
select: return a subset of the columns of a data frame, using a flexible notationfilter: extract a subset of rows from a data frame based on logical conditionsarrange: reorder rows of a data framerename: rename variables in a data framemutate: add new variables/columns or transform existing variablessummarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata%>%: the “pipe” operator is used to connect multiple verb actions together into a pipelineTo present the next examples, we will be using a dataset containing air pollution and temperature data for the city of Chicago in the U.S.
chicago <- readRDS("chicago.rds")
dim(chicago)
## [1] 6940 8
str(chicago)
## 'data.frame': 6940 obs. of 8 variables:
## $ city : chr "chic" "chic" "chic" "chic" ...
## $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
## $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
## $ date : Date, format: "1987-01-01" "1987-01-02" ...
## $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
## $ pm10tmean2: num 34 NA 34.2 47 NA ...
## $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
## $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
select()The select() function can be used to select columns of a data frame.
Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly.
names(chicago)[1:3]
## [1] "city" "tmpd" "dptp"
subset <- select(chicago, city:dptp)
head(subset)
## city tmpd dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125
You can also omit variables using the select() function by using the negative sign.
subset_omit <- select(chicago, -(city:dptp))
head(subset_omit)
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 1987-01-02 NA NA 3.304348 23.19099
## 3 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 1987-01-05 NA NA 4.750000 30.33333
## 6 1987-01-06 NA 48.00000 5.833333 25.77233
The select() function also allows a special syntax that allows you to specify variable names based on patterns.
To keep every variable that ends with a “2”, we could do
subset_vars_end2 <- select(chicago, ends_with("2"))
str(subset_vars_end2)
## 'data.frame': 6940 obs. of 4 variables:
## $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
## $ pm10tmean2: num 34 NA 34.2 47 NA ...
## $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
## $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
To keep every variable that starts with a “d”, we could do
subset_vars_startd <- select(chicago, starts_with("d"))
str(subset_vars_startd)
## 'data.frame': 6940 obs. of 2 variables:
## $ dptp: num 31.5 29.9 27.4 28.6 28.9 ...
## $ date: Date, format: "1987-01-01" "1987-01-02" ...
The filter() function is used to extract subsets of rows from a data frame. This function is similar to the existing subset() function in R but is quite a bit faster.
Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do
chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
## 'data.frame': 194 obs. of 8 variables:
## $ city : chr "chic" "chic" "chic" "chic" ...
## $ tmpd : num 23 28 55 59 57 57 75 61 73 78 ...
## $ dptp : num 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
## $ date : Date, format: "1998-01-17" "1998-01-23" ...
## $ pm25tmean2: num 38.1 34 39.4 35.4 33.3 ...
## $ pm10tmean2: num 32.5 38.7 34 28.5 35 ...
## $ o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ...
## $ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ...
summary(chic.f$pm25tmean2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.05 32.12 35.04 36.63 39.53 61.50
We can place an arbitrarily complex logical sequence inside of filter(), so we could for example extract the rows where PM2.5 is greater than 30 and temperature is greater than 80 degrees Fahrenheit.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)
## date tmpd pm25tmean2
## 1 1998-08-23 81 39.60000
## 2 1998-09-06 81 31.50000
## 3 2001-07-20 82 32.30000
## 4 2001-08-01 84 43.70000
## 5 2001-08-08 85 38.83750
## 6 2001-08-09 84 38.20000
## 7 2002-06-20 82 33.00000
## 8 2002-06-23 82 42.50000
## 9 2002-07-08 81 33.10000
## 10 2002-07-18 82 38.85000
## 11 2003-06-25 82 33.90000
## 12 2003-07-04 84 32.90000
## 13 2005-06-24 86 31.85714
## 14 2005-06-27 82 51.53750
## 15 2005-06-28 85 31.20000
## 16 2005-07-17 84 32.70000
## 17 2005-08-03 84 37.90000
arrange()The arrange() function is used to reorder rows of a data frame according to one of the variables/columns.
We can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.
head(chicago, 10)
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
## 3 chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333
## 6 chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233
## 7 chic 34.5 26.750 1987-01-07 NA 41.00000 9.291667 20.58171
## 8 chic 29.0 22.000 1987-01-08 NA 36.00000 11.291667 17.03723
## 9 chic 26.5 29.000 1987-01-09 NA 33.28571 4.500000 23.38889
## 10 chic 32.5 27.750 1987-01-10 NA NA 4.958333 19.54167
chicago <- arrange(chicago, date)
We can now check the first few rows
head(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 1 1987-01-01 NA
## 2 1987-01-02 NA
## 3 1987-01-03 NA
and the last few rows.
tail(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 6938 2005-12-29 7.45000
## 6939 2005-12-30 15.05714
## 6940 2005-12-31 15.00000
Columns can be arranged in descending order too by using the special desc() operator.
chicago <- arrange(chicago, desc(date))
Looking at the first three and last three rows shows the dates in descending order.
head(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 1 2005-12-31 15.00000
## 2 2005-12-30 15.05714
## 3 2005-12-29 7.45000
tail(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 6938 1987-01-03 NA
## 6939 1987-01-02 NA
## 6940 1987-01-01 NA
rename()The rename() function is designed to rename a variable in a data frame.
Here you can see the names of the first five variables in the chicago data frame.
head(chicago[, 1:5], 3)
## city tmpd dptp date pm25tmean2
## 1 chic 35 30.1 2005-12-31 15.00000
## 2 chic 36 31.0 2005-12-30 15.05714
## 3 chic 35 29.4 2005-12-29 7.45000
The dptp column is supposed to represent the dew point temperature and the pm25tmean2 column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible.
chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago[, 1:5], 3)
## city tmpd dewpoint date pm25
## 1 chic 35 30.1 2005-12-31 15.00000
## 2 chic 36 31.0 2005-12-30 15.05714
## 3 chic 35 29.4 2005-12-29 7.45000
The syntax inside the rename() function is to have the new name on the left-hand side of the = sign and the old name on the right-hand side.
mutate()The mutate() function exists to compute transformations of variables in a data frame.
For example, with air pollution data, we often want to detrend the data by subtracting the mean from the data. That way we can look at whether a given day’s air pollution level is higher than or less than average (as opposed to looking at its absolute level).
Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.
chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago)
## city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
## 2 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
## 3 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
## 4 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
## 5 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
## 6 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944
## pm25detrend
## 1 -1.230958
## 2 -1.173815
## 3 -8.780958
## 4 1.519042
## 5 7.329042
## 6 -7.830958
group_by()The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is. So the stratum is the year, and that is something we can derive from the date variable. In conjunction with the group_by() function we often use the summarize() function.
The general operation here is a combination of splitting a data frame into separate pieces defined by a variable or group of variables (group_by()), and then applying a summary function across those subsets (summarize()).
First, we can create a year varible using as.POSIXlt().
chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)
head(chicago)
## city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
## 2 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
## 3 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
## 4 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
## 5 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
## 6 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944
## pm25detrend year
## 1 -1.230958 2005
## 2 -1.173815 2005
## 3 -8.780958 2005
## 4 1.519042 2005
## 5 7.329042 2005
## 6 -7.830958 2005
Now we can create a separate data frame that splits the original data frame by year.
years <- group_by(chicago, year)
Finally, we compute summary statistics for each year in the data frame with the summarize() function.
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
## # A tibble: 19 x 4
## year pm25 o3 no2
## <dbl> <dbl> <dbl> <dbl>
## 1 1987 NaN 63.0 23.5
## 2 1988 NaN 61.7 24.5
## 3 1989 NaN 59.7 26.1
## 4 1990 NaN 52.2 22.6
## 5 1991 NaN 63.1 21.4
## 6 1992 NaN 50.8 24.8
## 7 1993 NaN 44.3 25.8
## 8 1994 NaN 52.2 28.5
## 9 1995 NaN 66.6 27.3
## 10 1996 NaN 58.4 26.4
## 11 1997 NaN 56.5 25.5
## 12 1998 18.3 50.7 24.6
## 13 1999 18.5 57.5 24.7
## 14 2000 16.9 55.8 23.5
## 15 2001 16.9 51.8 25.1
## 16 2002 15.3 54.9 22.7
## 17 2003 15.2 56.2 24.6
## 18 2004 14.6 44.5 23.4
## 19 2005 16.2 58.8 22.6
In a slightly more complicated example, we might want to know what are the average levels of ozone (o3) and nitrogen dioxide (no2) within quintiles of pm25. A slicker way to do this would be through a regression model, but we can actually do this quickly with group_by() and summarize().
First, we can create a categorical variable of pm25 divided into quintiles.
qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)
chicago <- mutate(chicago, pm25.quint = cut(pm25, qq))
head(chicago)
## city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
## 2 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
## 3 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
## 4 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
## 5 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
## 6 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944
## pm25detrend year pm25.quint
## 1 -1.230958 2005 (12.4,16.7]
## 2 -1.173815 2005 (12.4,16.7]
## 3 -8.780958 2005 (1.7,8.7]
## 4 1.519042 2005 (16.7,22.6]
## 5 7.329042 2005 (22.6,61.5]
## 6 -7.830958 2005 (1.7,8.7]
Now we can group the data frame by the pm25.quint variable.
quint <- group_by(chicago, pm25.quint)
## Warning: Factor `pm25.quint` contains implicit NA, consider using
## `forcats::fct_explicit_na`
head(quint)
## # A tibble: 6 x 11
## # Groups: pm25.quint [4]
## city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2
## 2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8
## 3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0
## 4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3
## 5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5
## 6 chic 35 29.6 2005-12-26 8.4 8.5 14.0 16.8
## # ... with 3 more variables: pm25detrend <dbl>, year <dbl>,
## # pm25.quint <fct>
Finally, we can compute the mean of o3 and no2 within quintiles of pm25.
summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),
no2 = mean(no2tmean2, na.rm = TRUE))
## # A tibble: 6 x 3
## pm25.quint o3 no2
## <fct> <dbl> <dbl>
## 1 (1.7,8.7] 21.7 18.0
## 2 (8.7,12.4] 20.4 22.1
## 3 (12.4,16.7] 20.7 24.4
## 4 (16.7,22.6] 19.9 27.3
## 5 (22.6,61.5] 20.3 29.6
## 6 <NA> 18.8 25.8
From the table, it seems there isn’t a strong relationship between pm25 and o3, but there appears to be a positive correlation between pm25 and no2. More sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of dplyr functions can often get you most of the way there.
%>%The pipeline operater %>% is very handy for stringing together multiple dplyr functions in a sequence of operations. Notice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.
third(second(first(x)))
This nesting is not a natural way to think about a sequence of operations. The %>% operator allows you to string operations in a left-to-right fashion, i.e.
first(x) %>% second %>% third
Take the example that we just did in the last section where we computed the mean of o3 and no2 within quintiles of pm25. There we had to
That can be done with the following sequence in a single R expression.
mutate(chicago, pm25.quint = cut(pm25, qq)) %>%
group_by(pm25.quint) %>%
summarize(o3 = mean(o3tmean2, na.rm = TRUE),
no2 = mean(no2tmean2, na.rm = TRUE))
## Warning: Factor `pm25.quint` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 6 x 3
## pm25.quint o3 no2
## <fct> <dbl> <dbl>
## 1 (1.7,8.7] 21.7 18.0
## 2 (8.7,12.4] 20.4 22.1
## 3 (12.4,16.7] 20.7 24.4
## 4 (16.7,22.6] 19.9 27.3
## 5 (22.6,61.5] 20.3 29.6
## 6 <NA> 18.8 25.8
This way we don’t have to create a set of temporary variables along the way or create a massive nested sequence of function calls.
Notice in the above code that we pass the chicago data frame to the first call to mutate(), but then afterwards we do not have to pass the first argument to group_by() or summarize(). Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous element in the pipeline.
Another example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.
mutate(chicago, month = as.POSIXlt(date)$mon + 1) %>%
group_by(month) %>%
summarize(pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
## # A tibble: 12 x 4
## month pm25 o3 no2
## <dbl> <dbl> <dbl> <dbl>
## 1 1 17.8 28.2 25.4
## 2 2 20.4 37.4 26.8
## 3 3 17.4 39.0 26.8
## 4 4 13.9 47.9 25.0
## 5 5 14.1 52.8 24.2
## 6 6 15.9 66.6 25.0
## 7 7 16.6 59.5 22.4
## 8 8 16.9 54.0 23.0
## 9 9 15.9 57.5 24.5
## 10 10 14.2 47.1 24.2
## 11 11 15.2 29.5 23.6
## 12 12 17.5 27.7 24.5
Here we can see that o3 tends to be low in the winter months and high in the summer while no2 is higher in the winter and lower in the summer.
ggplot2The package ggplot2 is also available with the tidyverse library
head(chicago)
## city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
## 1 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
## 2 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
## 3 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
## 4 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
## 5 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
## 6 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944
## pm25detrend year pm25.quint
## 1 -1.230958 2005 (12.4,16.7]
## 2 -1.173815 2005 (12.4,16.7]
## 3 -8.780958 2005 (1.7,8.7]
## 4 1.519042 2005 (16.7,22.6]
## 5 7.329042 2005 (22.6,61.5]
## 6 -7.830958 2005 (1.7,8.7]
To summarize continuous variable using a series of bars, we need to divide the observations into groups, or bins, and count how many are in each one.
By default, the geom_histogram() function will choose a bin size for us based on a rule of thumb.
p <- ggplot(data = chicago,
mapping = aes(x = pm25))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).
p <- ggplot(data = chicago,
mapping = aes(x = pm25))
p + geom_histogram(bins = 10)
## Warning: Removed 4447 rows containing non-finite values (stat_bin).
Change line color and fill color
p <- ggplot(data = chicago,
mapping = aes(x=pm25)) +
geom_histogram(color="darkblue", fill="lightblue")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).
Change line type
p <- ggplot(data = chicago,
mapping = aes(x=pm25)) +
geom_histogram(color="black", fill="lightblue",
linetype="dashed")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).
p <- ggplot(data = chicago,
mapping = aes(x=pm25)) +
geom_histogram(color="black", fill="pink",
linetype="dashed") +
labs(title="PM2.5 histogram plot",x="PM2.5", y = "Count")+
theme_classic()
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).
It’s also possible to use several at once to compare distributions. We can facet histograms by some variable of interest, or as here we can compare them in the same plot using the fill mapping.
pm25.quint_ <- unique(na.omit(chicago$pm25.quint))
pm25.quint_
## [1] (12.4,16.7] (1.7,8.7] (16.7,22.6] (22.6,61.5] (8.7,12.4]
## Levels: (1.7,8.7] (8.7,12.4] (12.4,16.7] (16.7,22.6] (22.6,61.5]
p <- ggplot(data = subset(chicago, subset = pm25.quint %in% pm25.quint_),
mapping = aes(x = dewpoint, fill = pm25.quint))
p + geom_histogram(alpha = 0.4, bins = 30)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
When working with a continuous variable, an alternative to binning the data and making a histogram is to calculate a kernel density estimate of the underlying distribution. The geom_density() function will do this for us.
p <- ggplot(data = subset(chicago, subset = pm25.quint %in% pm25.quint_),
mapping = aes(x = dewpoint, fill = pm25.quint, color = pm25.quint))
p + geom_density(alpha = 0.3)
## Warning: Removed 2 rows containing non-finite values (stat_density).